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(54) BIOPOLYMER AUTOMATIC IDENTIFYING METHOD 



(57) The invention aims to provide a highly accurate 
automatic biopolymer determination technique utilizing 
mass spectrometry whereby calibration prior to meas- 
urement or the addition of an internal standard to a sam- 
ple can be eliminated. The biopolymer automatic iden- 
tifying method of the invention comprises: retrieving a 
candidate molecule by matching an observed mass val- 
ue X obtained by mass spectrometry with a predeter- 
mined database; selecting an arbitrary number of can- 



didate molecules with high similarity scores; calibrating 
the observed mass value X using the candidate mole- 
cule as an internal standard; calculating relative error 
Ec between a calibrated mass value Xc and a theoretical 
mass value M of the candidate molecule; determining 
the standard deviation S Ec of the relative error; deter- 
mining a tolerance Tc of database search from the 
standard deviation S Ec ; and repeating a database 
search based on the tolerance Tc. 
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Description 

Technical Field 

[0001] The present invention relates to a biopolymer 
identifying technology utilizing mass spectrometry, and 
more specifically, to a biopolymer automatic identifying 
method capable of improving the accuracy of mass data 
obtained by mass spectrometry. 

Background Art 

[0002] Mass spectrometry is an instrumental analysis 
technique whereby sample molecules are ionized and 
then separated in accordance with the mass/charge ra- 
tio (m/z) for detection. Using this technique, qualitative 
analysis can be performed based on the resultant mass 
spectrum, and quantitative analysis can be performed 
based on ion quantities. 

[0003] The mass spectrometer ("MS") used for such 
a measurement of molecular mass roughly consists of 
an ionization unit (ion source) for ionizing a sample, an 
analyzer for separating ions in accordance with the 
mass/charge ratio m/z (m: mass, and z: charge 
number), a detection unit (detector) for detecting sepa- 
rated ions, and a data analysis unit. 
[0004] When subjecting sample molecules to mass 
spectrometry using the aforementioned mass spec- 
trometer, the mass spectrometer must be calibrated pri- 
or to measurement. Specifically, since errors might be 
introduced into the measurement by the mass spec- 
trometer due to factors such as temperature changes, 
voltage accuracies, and electric circuit noise, a calibra- 
tion procedure must be carried out prior to the start of 
measurement In the calibration procedure, the chroma- 
tograph or the like is removed from the mass spectrom- 
eter, and a predetermined mass-calibration standard 
substance is introduced into the mass spectrometer so 
as to obtain an observed mass value. The observed 
mass value is compared with a known theoretical mass 
value, and the apparatus is adjusted such that no sys- 
tematic error occurs in mass values (a calibration pro- 
cedure according to the external standard method). 
[0005] If an even higher accuracy of mass values is 
to be obtained, an additional calibration procedure must 
be performed, whereby a known substance is mixed in 
the sample and its mass is measured, and the actual 
measurement value is adjusted based on the mass val- 
ue (a calibration procedure according to the internal 
standard method). 

[0005] In general, identification of biopolymers, such 
as peptides or proteins, using a mass spectrometer (in- 
cluding the tandem mass spectrometer) involves a pro- 
cedure referred to as a database search (or a library 
search). In this procedure, the observed mass value of 
an unknown sample molecule obtained by mass spec- 
trometry is searched for by matching with a database 
(library) in which the primary structures or sequences of 



approximately 100,000 kinds of molecules are stored, 
in an expected reference (standard) spectrum calculat- 
ed based on the structure information, molecules with a 
spectrum similar to that of the unknown molecule under 

5 investigation are allocated scores and selected. Candi- 
date molecules are thus narrowed and listed, thereby 
eventually identifying the unknown sample molecule. 
[0007] However, the above-described mass spec- 
trometer calibration procedure is very troublesome 

10 work, requires much adjustment time, and is primarily 
responsible for the drop in work efficiency caused by the 
conventional mass measurement operation. Namely, it 
has been impossible to carry out a measurement oper- 
ation with high efficiency based on a continuous opera- 
te tion of the mass spectrometer (without calibration), Fur- 
ther, in a measurement system employing a plurality of 
mass spectrometers, it has been extremely difficult to 
achieve uniform accuracy and reliability in the individual 
apparatuses even if they are calibrated individually ac- 

20 cording to the external standard. 

[0008] in the case of the external standard calibration, 
it has been impossible, using the conventional process 
of database search as described above, to eliminate 
from the measurement data the influence of erroneous 

25 measurement in the mass spectrometer produced by in- 
fluences of the external environment. Particularly, even 
those measurement errors due to subtle temperature 
changes (on the order of 0.2°C) in the measurement en- 
vironment could not be ignored in some cases. 

30 [0009] Furthermore, when a complex biopolymer mix- 
ture is measured by the conventional internal standard 
calibration method, the internal standard substance and 
the ion signals from the sample are superposed, which 
prevents ion analysis. Thus, it has been difficult to select 

35 the type or concentration of the substance that is put 
into the sample as the internal standard. In order to 
achieve high mass accuracy for a wide range of masses, 
it has been necessary to introduce a number of internal 
standard substances. 

40 [0010] Also, human confirmation of each identification 
result has been necessary, as the identification reliability 
has been low. Recent progress in mass spectrometry, 
however, has made direct analysis of increasingly more 
complex biopolymer mixtures possible. This has result- 

45 ed in huge volumes of data that could not possibly be 
individually confirmed by the human eyes. Therefore, 
there has been a need to develop a highly reliable au- 
tomatic identification technique for the analysis of com- 
plex biopolymer mixtures. 

50 

Disclosure of the Invention 

[001 1] It is therefore an object of the invention to pro- 
vide a highly accurate and reliable method for automat- 
55 ically identifying biopolymers that is based solely on da- 
ta processing and that eliminates the need for calibra- 
tion of the mass spectrometer prior to measurement or 
the addition of an internal standard to the sample in ad- 
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vance. 

[001 2] In order to achieve the aforementioned object, 
the invention provides a biopolymer automatic identify- 
ing method Implementing the following procedures (1 ) 
-(7): 

(1) A mass measurement procedure for measuring 
the mass of a biopolymer in a sample by mass spec- 
trometry; (2) A database search procedure for 
searching a predetermined database for candidate 
molecules by matching an observed mass value ob- 
tained by said mass measurement procedure with 
the predetermined database; (3) a candidate mole- 
cule selection procedure for selecting an arbitrary 
number of candidate molecules having a high sim- 
ilarity score; (4) a mass value calibration procedure 
for calibrating the observed mass value using the 
candidate molecules as an internal reference; (5) a 
procedure for calculating relative error between a 
calibrated mass value of a candidate molecule ob- 
tained in a previous procedure and a theoretical 
mass value in order to determine the standard de- 
viation of such relative error; (6) a procedure for de- 
termining the tolerance (allowable error) of the da- 
tabase search procedure based on the standard de- 
viation; and (7) a procedure for repeating the data- 
base search procedure on the basis of the toler- 
ance. The term "database" herein refers to a data- 
base of molecular structures or sequences. 

[0013] The mass value calibration procedure (4) may 
be a procedure in which relative error between an actual 
measurement value and a theoretical mass value of a 
candidate molecule selected by the candidate molecule 
selection procedure is calculated and a systematic error 
in the observed mass value is estimated by creating a 
least square line (a line expressed by the equation y = 
a x M + b, where M is the theoretical mass value) based 
on the plots of the theoretical mass value and the rela- 
tive error, and a procedure in which the observed mass 
value is calibrated by subtracting the systematic error 
from the entire measurement values. 
[0014] For example, in the case of a time-of-f light 
mass spectrometer, the systematic error of a candidate 
molecule is determined from the aforementioned least 
square line. The systematic error is then subtracted from 
the entire actual measurement values. Specifically, the 
equation (Xc-M)/M = (X-M)/M-(aM+b), where X is an ob- 
served mass value, Xc is a calibrated mass value, and 
M is a theoretical mass value, is modified to Xc = X-M 
(aM+b). 

[0015] Although the theoretical mass value M is given 
for the candidate molecule, it is not given to all of the 
actual measurement values. Therefore, if the entire ac- 
tual measurement values are to be calibrated, the term 
M(aM+b) in the above equation must be approximated 
by an actual measurement value. The values of a and 
b are generally much smaller than those of X and Xc, 



such that M(aM+b) « Xc(aX+b). Substituting this into the 
above equation yields Xc = X-Xc(aX+b), which can be 
modified to obtain Xc = X/(1 +(aX+b)) based on which all 
of the observed mass values can be calibrated. 

s [0016] In accordance with the biopolymer automatic 
identifying method of the invention as described above, 
very accurate mass values can be obtained from com- 
plex biopolymer mixtures solely by data processing. The 
high accuracy of the resultant mass values makes it pos- 

10 sible to identify and determine the biopolymers more un- 
ambiguously. Thus, the invention provides a highly reli- 
able automatic identifying method capable of analyzing 
complex biopolymer mixtures. 

[0017] The invention also provides information re- 
15 cording media, such as a CD-ROM, in which program 
information for causing a computer system to carry out 
the individual procedures constituting the above-de- 
scribed biopolymer automatic identifying method is 
Stored. 

20 [0018] The aforementioned means makes it possible 
to eliminate the calibration operation of the mass spec- 
trometer prior to measurement and the addition of an 
internal standard to the sample in advance. It also allows 
the biopolymer automatic identifying method to be im- 

25 plemented with high accuracy and reliability based sole- 
ly on data processing. 

Brief Description of the Drawings 

so [0019] 

Fig. 1 shows the relationship between the mass val- 
ue (m/z) identified in Example 1 and error. 
Fig. 2 shows the result of identification priorto mass 
35 calibration in Example 2. 

Fig. 3 shows the result of identification after mass 
calibration in Example 2. 

Fig. 4 shows the relationship between the mass val- 
ue (m/z) identified in Example 2 and error. 

40 

Best Mode for Carrying Out the Invention 

[0020] A preferred embodiment of the biopolymer au- 
tomatic identifying method in accordance with the inven- 
ts tion will be described. It should be obvious, however, 
that the invention is not limited by the following embod- 
iment. 

[0021] The mass of an unknown biopolymer in a sam- 
ple is initially measured by a conventional mass spec- 

50 trometry method depending on purpose, thereby obtain- 
ing an observed mass value X. The mass spectrometry 
method may employ a tandem mass spectrometer, for 
example, which consists of a plurality of analyzers cou- 
pled in tandem. Specifically, in the tandem mass spec- 

55 trometer, a particular ion (a parent ion) in a mixture is 
selected by the initial analyzer, and a collision dissoci- 
ation is performed between the thus selected ion and 
an inert gas in the next analyzer. Then, a dissociated 
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Ion (generated ion) indicating the internal structure in- 
formation is subjected to mass spectrometry by the final 
analyzer. 

[0022] An observed mass value X obtained by the 
above mass measurement procedure is converted into 
aformat (a binary file: mass value and intensity) that can 
be read by conventional database search engines. The 
thus converted value is then matched with a database 
in which a number of molecules with known mass values 
are stored, so as to search for a candidate molecule that 
could possibly be the unknown biopolymer under inves- 
tigation. 

[0023] For the conversion of the observed mass value 
X, any of the generally available types of software pro- 
vided by the mass spectrometer manufacturers, such 
as MassLynx (from Micromass), may be appropriately 
utilized, The database search may be appropriately car- 
ried out by using any commercially available database 
software, such as Mascot (from Matrix Science). 
[0024] From the results of the database search pro- 
cedure, an arbitrary number of candidate molecules (or 
a set thereof) with high similarity scores are selected. 
The magnitude n of the set may be any number such 
that it renders statistical processing possible. 
[0025] Thereafter, the relative error E between the ob- 
served mass value X and its theoretical mass value M 
of each of the candidate molecules selected by the 
above candidate molecule selection procedure is calcu- 
lated in accordance with the following equation (1): 

E = (X-M)/M (1) 

[0026] A mean value m E of the thus obtained relative 
error E is then calculated in accordance with the follow- 
ing equation (2): 

m E = Z(E)/n (2) 

[0027] The standard deviation S E of the relative error 
E is then calculated by the following equation (3): 

S E = (£(E-m E ) 2 /(n-1)) (1/2) (3) 

Using this standard deviation, it is determined whether 
or not it is appropriate to use a particular candidate mol- 
ecule for the internal standard. When S E < m E , the cal- 
ibration is determined to be valid. 
[0028] The magnitude of the systematic error is then 
estimated and subtracted from the observed mass value 
X, thereby obtaining a calibrated mass value Xc. For ex- 
ample, in the case of a time-of-f light mass spectrometer, 
the systematic error of the candidate molecule can be 
determined from the least square line y = ax+b with re- 
spect to the plots of the theoretical mass value and the 
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relative error, in the following procedure. When the rel- 
ative error after the calibration of the candidate molecule 
is Ec = (Xc-M)/M, Ec - E-(aM+b). Therefore: 

(Xc-M)/M « (X-M)/M-(aM+b) (4) 

where X is an observed mass value, Xc is a calibrated 
mass value, and M is a theoretical mass value. 
10 [0029] Specifically, the above equation (4) is modified 
to obtain the following equation (5): 

Xc = X-M(aM+b) (5) 

15 

[0030] It is noted that although the theoretical mass 
value is given for the candidate molecule, it is not given 
for all of the actual measurement values. Therefore, in 
order to calibrate all of the actual measurement values, 
20 the term "M(aM+b)" in the equation (5) must be approx- 
imated by an actual measurement value. The values of 
a and b are generally much smaller than those of X and 
Xc, such that M(aM+b) « Xc(aX+b). Substituting this into 
Equation (6) yields the following equation (6): 

25 

Xc = X-Xc(aX+b) (6) 

[0031] This equation (6) is modified to obtain the fol- 
30 lowing equation (7): 

Xc = X/(1+(aX+b)) (7) 

35 based on which all of the observed mass values are cal- 
ibrated. 

[0032] The values of b and a in the aforementioned 
least square line can be determined from the following 
equations (8) and (9), respectively: 

40 

b = T{(M-m M ) x (E-m E )}/ £{(M-m M ) A 2} (8) 



a - m E - b x m M (9) 

[0033] The value of m M , which is the mean value of 
the theoretical mass value M of the candidate molecule, 
can be determined from the following equation, (10): 

m M =T(M)/n (10) 

[0034] The relative error Ec between the mass value 
Xc after mass calibration and the theoretical mass value 
m can be determined from the following equation (11): 
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Ec = E-(aM4to) (11) 

[0035] Thereafter, the mean value m Ec of the relative 
error Ec = (Xc-M)/M obtained for the candidate molecule 
and the standard deviation S Ec are determined from the 
following equations (12) and (13), respectively: 

m Ec = Z(Ec)/n (12) 

S Ec = {2:(E-m Ec ) 2 /(n-1)} (1/2) (13) 

[0036] Based on the thus obtained mean value m Ec , 
the calibration is evaluated. Ideally, m Ec = 0. Tolerance 
Tc for a database search is then calculated based on 
the standard deviation S Ec , using the following equation 
(14); 

Tc = KxS fe (14) 

where K is 1 .5 to 3.0, thereby completing the above-de- 
scribed series of calibration procedures. 
[0037] In the above equation (14), K is an empirical 
constant for designating the confidence interval of the 
mass value. The K value can be appropriately deter- 
mined depending on the accuracy of the software used 
for the database search. The higher the identification 
performance of the database search software, the clos- 
er K can be to 3, where a 99.7% confidence interval can 
be obtained. In the case of Mascot (Matrix Science) da- 
tabase software, K = 1 .5 can be empirically employed. 
[0038] Based on the resultant tolerance Tc (Tc.,), the 
same database search is conducted once again. As 
needed, the above-described series of calibration and 
database search procedures are repeated a plurality of 
times so as to narrow the range of the tolerance Tc 
(T-nc^Tc^-*...) gradually, thereby enhancing the 
candidate molecule selection accuracy. Tc 1 indicates 
the tolerance obtained by the initial calibration opera- 
tion, and Tc2 indicates the tolerance obtained by the sec- 
ond calibration operation. 

[0039] In this way, the accuracy of candidate molecule 
identification can be enhanced. Namely, the accuracy 
of identification of unknown sample molecules can be 
improved. 

[0040] The above-described procedures can be ren- 
dered into desired computer program information which 
can then be stored in various forms of information re- 
cording media, such as CD-ROMs, Floppy™ discs, or 
other forms of computer hardware, such as servers. In 
this way, the program can be executed on a desired 
computer system or a computer network (via informa- 
tion and communications technology). 
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EXAMPLES 

[0041] The time-of -flight mass spectrometer is an ap- 
paratus for measuring the time it takes for an ion to travel 
5 a certain distance L in order to measure its mass ac- 
cording to the relationship between the mass m and the 
time of flight T expressed by the following equation (1 5): 

10 T = L-(2eV) A (-1/2)-(m/z) A (1/2) (15) 

where e is the elementary charge and z is the charge 
number. 

[0042] The mass measurement accuracy of this ap- 
15 paratus depends on L and the acceleration voltage V. 
L, which is an inherent value of the apparatus, may fluc- 
tuate due to temperature-caused expansions or con- 
tractions. V may fluctuate due to the drift in the supply 
voltage. Depending on the measurement conditions, 
20 these fluctuations may cause a systemic mass error of 
100 ppm or more. However, variations among mass er- 
rors (which reflect the performance of the mass spec- 
trometer) are relatively small as compared with the 
mean value of the systematic error. By taking advantage 
25 of this fact, the systematic error can be exclusive ^elim- 
inated. 

[0043] In the following, an example in which identifi- 
cation accuracy has been improved by the method of 
the invention will be described. 

30 

(Example 1) 

[0044] One hundred fmol of tryptic digest of human 
serum albumin was measured by HPLC-MS/MS, and a 

35 database search was conducted by MS/MS ions search 
using the commercially available Mascot database 
search software, (search parameters: peptide tolerance 
250 ppm; and MS/MS tolerance 0.5Da). 
[0045] Based on the search results, the relative error 

40 e ((X-M)/M ppm) with respect to the theoretical m/z iden- 
tified for the 20 ions with the highest scores was deter- 
mined. The relative error E was then plotted with respect 
to the theoretical m/z, as shown in Fig. 1 . As shown, the 
mean value of the original relative error E (indicated by 

45 ♦) was approximately 1 70 ppm, whereas the variations 
in E were within the 150-175 ppm range, which are 
smaller than the value of E per se. 
[0046] The mass was calibrated by finding a least 
square line with respect to this group of ions and then 

so subtracting it from the error in each ion. The relative er- 
ror Ec after calibration (indicated by Bin Fig. 1)was sim- 
ilarly plotted, as shown in Fig. 1 . The database search 
parameters determined from the variations in Ec (rep- 
resented by the standard deviation) were such that the 

55 peptide tolerance was 1 8 ppm and the MS/MS tolerance 
was 0.080 Da. Thus, the mass calibration allowed the 
tolerances in a search to be reduced from 230 to 1 8 ppm 
and from 0.5 to 0.080 Da; namely, by a factor of approx- 
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imately 14 and 6, respectively, thereby enhancing the 
identification reliability. 

(Example 2) 

[0047] The following shows that erroneous identifica- 
tion can actually be corrected by the mass calibration 
method of the invention. 

[0048] A peptide SRLDQELK, which is known to be 
liable to erroneous identification during a database 
search based on mass data, was synthesized in a con- 
ventional manner. One hundred fmol of the peptide was 
then mixed with 100 fmol of the aforementioned tryptic 
digest of human serum albumin, and a similar experi- 
ment was conducted. Under the conventional search 
conditions (with search parameters of peptide tolerance 
250 ppm and MS/MS tolerance 0.5 Da), the synthetic 
peptide was erroneously identified, as shown in Fig. 2. 
[0049] When the above-described mass calibration 
was performed, the peptide was correctly identified, as 
shown in Fig. 3. 

[0050] Each ion in the MS/MS spectrum of the peptide 
was assigned to a theoretical product ion (b and y ion 
sequences) of each peptide (EKLTQELK and 
SRLDQELK) that had been identified, and its systematic 
error was plotted with respect to the m/z, as shown in 
Fig. 4. In the case of SRLDQELK (indicated by ♦ in Fig. 
4), the relative error of all of the ions was within a narrow 
range, whereas in the case of EKLTQELK (indicated by 
■in Fig. 4), the plots exhibited two different distributions. 
Thus, by improving the mass accuracy by data process- 
ing, it became possible to correctly distinguish and iden- 
tify peptides with similar masses and with identical se- 
quences in the c-terminal portion. 

INDUSTRIAL APPLICABILITY 

[0051] In accordance with the invention, the calibra- 
tion operation of the mass spectrometer prior to meas- 
urement, or the addition of an internal standard to a sam- 
ple, can be eliminated, thereby enabling continuous op- 
eration of the mass spectrometer (without interruption 
by calibration operations). As a result, operators are 
freed from the burden of equipment adjustment, such 
that the efficiency of the molecule identification opera- 
tion can be improved. 

[0052] Furthermore, the influence of error inherent in 
a mass spectrometer can be eliminated, and a highly 
accurate and reliable biopolymer automatic identifying 
method can be implemented based solely on data 
processing. In a measurement system employing a plu- 
rality of mass spectrometers, uniform data accuracy can 
be obtained in individual mass spectrometers, thereby 
reliably preventing the erroneous identification of an un- 
known sample molecule. 



Claims 

1. A biopolymer automatic identifying method com- 
prising: 

5 

a mass measurement procedure for measuring 
the mass of a biopolymer in a sample by mass 
spectrometry; 

a database search procedure for retrieving a 
10 candidate molecule by matching an observed 

mass value obtained by said mass measure- 
ment procedure with a predetermined data- 
base; 

a candidate molecule selection procedure for 
is selecting an arbitrary number of candidate mol- 

ecules with a high similarity score; 
a mass value calibration procedure for calibrat- 
ing the observed mass value using the candi- 
date molecules as an internal standard; 
20 a procedure for calculating relative error be- 

tween a calibrated mass value of a candidate 
molecule obtained by a previous procedure and 
a theoretical mass value, and for determining 
the standard deviation of said relative error; 
25 a procedure for determining the tolerance of 

said database search procedure from said 
standard deviation; and 
a procedure for repeating said database search 
procedure based on said tolerance. 

30 

2. The biopolymer automatic identifying method ac- 
cording to claim 1 , wherein said mass value calibra- 
tion procedure comprises: 

35 calculating relative error between an observed 

mass value and a theoretical mass value of a 
candidate molecule selected in said candidate 
molecule selection procedure; 
estimating a systematic error of the observed 
40 mass value by creating a least square line with 

respect to a plot of the theoretical mass value 
and the relative error; and 
calibrating the observed mass value by sub- 
tracting the systematic error from the entire ac- 
45 tual measurement values. 

3. An information recording medium in which program 
information for causing a computer system to carry 
out the individual procedures making up said bi- 

so opolymer automatic identifying method according 
to claim 1 or 2 is stored. 
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FIG. 2 
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